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Abstract: We consider the least angle regression and forward stagewise al- 
gorithms for solving penalized least squares regression problems. In Efron, 
Hastie, Johnstone & Tibshirani (2004) it is proved that the least angle re- 
gression algorithm, with a small modification, solves the lasso regression 
problem. Here we give an analogous result for incremental forward stage- 
wise regression, showing that it solves a version of the lasso problem that 
enforces monotonicity. One consequence of this is as follows: while lasso 
makes optimal progress in terms of reducing the residual sum-of-squares 
per unit increase in Li-norm of the coefficient /3, forward stage-wise is op- 
timal per unit L\ arc-length traveled along the coefficient path. We also 
study a condition under which the coefficient paths of the lasso are mono- 
tone, and hence the different algorithms coincide. Finally, we compare the 
lasso and forward stagewise procedures in a simulation study involving a 
large number of correlated predictors. 
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The lasso ( Tibshirani Il996h is a method for regularizing a least squares re- 



gression. Suppose we have predictor measurements Xij, j — 1, 2, . . . ,p and an 
outcome measurement yt, observed for cases i — 1,2, ...TV. The lasso fits a 
linear model 

p 

f{x)^f3o + J2^j(^J (1) 

by solving the optimization problem 

N / p \^ p 

™/3 X! \yi~ f^o~^ x.jPj subject to ^ < s (2) 

If the tuning parameter s > is large enough, this gives the ordinary least 
squares estimates. However, smaller values of s produce shrunken estimates /3, 
often with many components equal to zero. Choosing s can be thought of as 
choosing the number of predictors to include in a regression model. Thus the 
lasso can select predictors like subset selection methods. However, since it is 
a smooth optimization problem, it is less variable than subset selection, and 
can be applied to much larger problems (large in p). IChen. Donoho &: Saunderd 
( 1998h developed related technology in the context of signal processing. 



The criterion Q leads to a quadratic programming problem for each s, and 
thus standard numerical- analysis methods can be used to solve it. Figure [T] 
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Lasso 




Forward stagewise 




L1 Norm (Standardized) 



Fig 1. Coefficient profiles, simulated example. The L\ norm is computed on the standardized 
variables. The coefficients are given on their original scale, on which the details are more 
visible. The lasso starts to differ from LAR at the broken vertical line, when the gray co- 
efficient passes through zero. Forward stagewise starts to differ from LAR and lasso at the 
dotted vertical line, where the gray coefficient goes flat instead of turning back towards zero. 
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shows an example, based on simulated data with 10 predictors (details of the 
data generation and model are given later) . The top panel shows the coefficient 
profiles of the lasso solutions, as the bound s = J2 \ f^j\ is increased from up 
to the point where the full least squares solutions are obtained (right end of 
figure). 

Notice the piecewise linear nature of the lasso profiles. lEfron et al.l \2004 ) 
exploited this fact to derive a simple algorithm — least angle regression — for 
simultaneously solving the entire set of lasso problems (all values of s) . This 



work was motivated by an observation in lHastie. Tibshir ani fc FriedmanI (j2001 



Section 10.12.2) that the lasso profile bore a striking similarity to the coefficient 
profile produced by a version of boosting for linear models, which they named 
the incremental forward stagewise algorithm (hereafter FSg). 

This FSe algorithm (see Algorithm [1]) creates a coefficient profile as follows: 
at each step it increments the coefficient of that variable most correlated with 
the current resi duals by an amoun t ±e, with the sign determined by the sign of 
the correlation. Efron et al. ( 20041 ) in fact considered the limiting version of this 
algorithm, with e j 0, which also has piecewise linear coefficient paths. We refer 
to this as the infin itesimal forw ard stagewise algorithm, hereafter FSq or simply 
forward staaewise. lEbon. et al.l (2004) showed that under certain conditions, this 
FSq path is identical to the lasso path. However, for most problems they are 
different (e.g. Figure [T]), and sometimes strikingly so (Figure [7]). The FSq paths 
are much smoother than the lasso paths. 

The primary result in this paper is the characterization of FSq as a monotone 
version of the lasso, in a sense to be described in Section [31 As such, it is a more 
restricted version of the lasso, and hence the additional smoothness. Because of 
the monotonicity, the criterion cannot be defined pointwise (as lasso can), but 
instead defines the entire path via a differential equation. 

In Section |4] we generalize this characterization to loss functions other than 
squared error. Section [5] considers other candidate criteria, and Section [S] exam- 
ines conditions under which the lasso and FSq are the same. Section [7] compares 
the two procedures in a simulation study. Some proofs are given in the Appendix. 



2. Background: The LARS Algorithm 



Hastie et al.l (|200ir) showed that the solution path for the lasso is strikingly 
similar to that of a simplified version of "boosting". Boosting is an adap- 
tive, non-linear, function-fittin g method that has received much attention in 
the past ten vears dSch apirc fc Freundl 1997 . Schapire. Freund. Bartlett fc Led 
19981 Friedman. Hastie fc Tibshiranil 2000 ). In modern versions of boosting, the 
set of "variables" is a large space of binary trees, whi ch are selected, shru nk, 
and added to the current model. In their simplification, iHastie et al.l (|200ll ) re- 
placed boosted trees by an incremental forward stagewise algorithm for linear 
regression, reproduced here as Algorithm [TJ 



Th e Least Angle Regression algorithm (LAR, see Algorithmic]) of I Efron et al 
(|2004j ) was initially intended as the limiting version FSq of FSg . As we explain 
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Algorithm 1 Incremental Forward Stagewise Regression: FSc 

1. Start with r = y — y, (3i, P2, ■ ■ ■ fip = 0. 

2. Find the predictor Xj most correlated with r. 

3. Update (3j ^ (3j + Sj, where Sj — e ■ sign[corr(r, Xj)]; 

4. Update r ^ r — (5jXj, and repeat steps 2 and 3 until no predictor has any 
correlation with r. 



below, LAR is different from both FSq and lasso, but both can be obtained 
through simple modifications. The bottom panel of Figure[T]shows the coefficient 
profiles of FSq. Notice that they are similar to lasso and LAR, but tend to be 
smoother 

LAR is a kind of "democratic" alternative to a version of the commonly 
used forward-stepwise regression algorithm. Forward-stepwise regression starts 
with all coefficients equal to zero, and then builds a sequence of models by 
successively including one variable at a time, and updating the least-squares fit. 
The version we consider here enters at each stage the variable most correlated 
with the residualsll This process is repeated until all p predictors have been 
entered, or the residuals are zero. 

LAR uses a similar strategy, but only enters "as much" of a predictor as 
it "deserves" : the coefficient of the predictor is increased only up to the point 
where some other predictor has as much correlation with the current residual. 
This new predictor is entered, and the process is continued. Algorithm [2] gives 
more details. 



Algorithm 2 LARS: Least Angle Regression 

1. Standardize the predictors to have mean zero and variance 1. Start with 
the residual r = y — y, (3i, (32, ■ ■ ■ fip = 0- 

2. Find the predictor Xj most correlated with r. 

3. Move (3j from towards its least-squares coefficient (xj,r), until some 
other competitor x^ has as much correlation with the current residual as 
does Xj. 

4. Move {/3j , (3k) in the direction defined by their joint least squares coefficient 
of the current residual on (xj,Xfc), until some other competitor x; has as 
much correlation with the current residual. 

5. Continue in this way until all p predictors have been entered. After p steps, 
we arrive at the full least-squares solution. 



The profiles for LAR are shown in the middle panel of Figure [TJ They look 
similar to the lasso solutions, especially in the beginning. The first discrepancy 
is at the place marked by a vertical broken line in the lasso profiles. The LAR 

'^The name "LARS", derived from least angle regression and lasso, is the name we have 
given to our algorithm that implements LAR, lasso and FSq. 

^This can differ from the more traditional version, which includes the variable that leads 
to the largest drop in residual sum-of-squares. 



T. Hastie et al./The monotone lasso 



6 



profile passes through zero at this point, while the lasso profile hits zero, and 
stays there. This similarity is no coincidence. It turns out that with one mod- 
ification, the LAR procedure exactly produces the set of lasso solutions for all 
s. The modification needed is as follows: 



Algorithmic^ LARS: lasso Modification. 

5a. If a non-zero coefficient hits zero, drop it from the active set and recompute 
the current joint least squares direction. 



The LARS(lasso) algorithm is extremely efficient, requiring the same order of 
computation as that of a single least squares fit using the p predictors. Least an- 
gle regression always takes p steps to get to the full least squares estimates. The 
lasso path can have more than p steps, although the two are often quite similar. 
Algorithmic^ is an efficie nt way of comp uting the solution to any lasso problem. 



especi ally when N p (jPonoho fc Tsa ig 2006). lOsborne. Presnell fc Turlach 



(|2000l ) also discovered a piecewise-linear path for computing the lasso, which 



th ey called a homoto yy algorithm. 

Efron et al.l ( 2004 ) showed that another variant of the LAR algorithm gives 



FSo; see also Theorem [T] on page[TUl Suppose we have reached a point when a 
variable enters the active set A\ all variables Xj , j ^ A have correlation equal in 
magnitude with the current residual r. Then the new LAR direction is defined 
by the least-squares fit of r on X^i. The modification needed to achieve FSq 
replaces this by a type of non-negative least squares direction. 

Algorithm [2]b LARS: FSp Modification. 

4. Find the new direction by solving the constrained least squares problem 

minft||r — X^^Hj subject to bjSj > 0, j £ A, 

where Sj is the sign of (xj , r) . 

The constraints arise from the fact that in step 3 of the incremental forward 
stagewise procedure, the coefficient of each predictor is increased in the direction 
of its correlation with the current residual. In Figure [T] LAR and FSq start to 
differ (vertical dotted line) when the third variable enters A. 

The top left panel of Figure [3] on page [T^] shows the residual sum of squares 
(RSS) for each of the three procedures, as a function of the Li norm of the 
coefficient vector. As expected, the lasso curve lies below the other two, because 
it decreases the RSS the fastest per unit increase in Li norm. The right panel 
plots RSS against the Li arc-length of the coefficient profile; here FSq wins — 
a point we enlarge on in the next section. 

In summary, we see that the forward stagewise and LAR algorithms "nearly" 
solve the Li-penalized regression problem. It is natural to ask: what problems 
are the forward stagewise an d LAR algori t hms s olving? Keith Knight asked that 



question in the discussion of lEfron et al.l (|2004l ). 
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We provide some answers to the first question in this paper. We make three 
main contributions: 

• we characterize forward stagewise as a monotone version of the lasso, in 
an extended space of variables consisting of each variable and its negative. 

• we study a condition under which the profiles of all three methods are 
monotone, and hence the three methods coincide. 

• we compare the lasso and forward stagewise procedures in a simulation 
study involving a large number of correlated predictors. 

3. Forward Stagewise and the Monotone Lasso 

In this section we consider an expanded representation of the lasso problem 
which facilitates a clearer understanding of the forward stagewise procedure. For 
each predictor Xj, we include its negative version —Xj, resulting in an expanded 
data set with 2p predictors. In matrix notation we create an expanded data 
matrix X = [X : —X]. In this framework the lasso problem becomes 



In what follows we will sometimes suppress the constant term /?o, which can 
always be removed once and for all by centering all the variables. Since each 
value of the bound s characterizes a solution, we can use s to index the solution 
/3(s); whenever the constraint is active, the solution satisfies ||/3(s)||i — s, and 
we say the solution profile is parametrized by Li-norm. Problem ([3]) is equiv- 
alent to a standard representation for solving the lasso problem by quadratic 
programming, and the KKT conditions ensure that at most one of and /3~ 
are greater than zero at the same time. Hence by augmenting the data with the 
negative of the variables, the positive lasso in the enlarged space is equivalent 
to the original lasso problem. Figure [2] (top pair of panels) shows the coefficient 
paths of the positive and negative variables for the lasso solution in Figure [T] 

In the lower pair of plots, an additional constraint is imposed on this sequence 
of lasso problems: the coefficient paths are constrained to be monotone non- 
decreasing. These monotone paths are exactly equivalent to the paths of the 
forward-stagewise algorithm. By this we mean that the collapsed versions of 
the paths (subtracting the coefficients for the negative versions of the variables, 
from the corresponding coefficients for the positive versions) are exactly the 
forward-stagewise paths in the lower panel in Figure [TJ 

This leads us to characterize the forward-stagewise algorithm as a monotone 
version of the lasso. These extra restrictions are an additional form of regular- 
ization, leading to smoother coefficient profiles. 




(3) 



p 



subject to f3^,(3- > Vj and ^(/3+ + ) < s 
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This expanded space of variables creates a niore natural analog of boosting, 
which operates in a large dictionary of binary trees. For every tree, its negative 
is also availablelfl In the expanded space the equivalent of Algorithm [1] is given 
in Algorithm [3] It is obvious that Algorithm [3] generates monotone coefficient 

Algorithm 3 Monotone Incremental Forward Stagewise Regression 

1. Start with r = y - y, /32, . • • /?2p = 0. 

2. Find the predictor Xj most positively correlated with r. 

3. Update (ij ^ (3j + e. 

4. Update r ^ r — exj , and repeat steps 2 and 3 until no predictor has any 
correlation with r. 



paths, indexed by the num ber of steps w, or t he total distance stepped t = m-e. 
Drawing on the results of lEfron et al. ( 2004 ). we show in Theorem [1] that the 



limit as e j leads exactly to the monotone representation as in Figure [21 First 
we define the notion of Li arc-length. 

Definition 1. Suppose j3{t) is a one- dimensional differentiable curve in t > 0, 
with /3(0) = 0. The Li arc-length of (3{t) in [0, t] is given by 

TV(/3,t)= / (4) 



where j3{s) = dl3[s)/ds. 

We have named the arc-length "TV" for total-variation; the Li arc-length of 
I3(t) up to time t is the sum of total variation measures for each of its coordinate 
functions, and is a measure of roughness of the curve. 

For a piecewise-differentiable continuous curve, the arc-length is the sum of 
the arc- lengths of the differentiable pieces. The following lemma is easily proved: 

Lemma 1. If the coordinates of [3{t) are monotone and piecewise differentiable 
in t, then TV(/3,t) = ||^(i)lli- 

Hence the arc-length and Li norm for a monotone coefficient profile are the 
same. 

Although it is convenient to use this expanded representation X, we can 
always collapse to the original representation X. The coefficients in the original 

representation are simply the paired differences l3j{t) = (3'^{t) — l3~{t). Note 
that 

• The Li norm for the lasso coefficients is the same in either representation, 
since only one coefficient in each pair is non-zero at a time (see the proof 
for part 1 of Theorem [1]) . 

• The Li arc-length for FSo is the Li norm in the expanded representation, 
and is equal to the Li arc-length in the original representation. This is 
NOT the same as the Li norm in the original representation. 



^We can think of a tree as a variable; the N values are obtained by passing the training 
data down to the terminal nodes. 
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Lasso 




Monotone Lasso 




20 40 60 

L1 Norm (Standardized) 



Fig 2. Expanded coefficient profiles, simulated example. The L\ norm is computed on the 
standardized variables. The coefficients are given on their original scale, on which the details 
are more visible. 
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Every point along the lasso path is the solution to a convex optimization 
problem. Unfortunately, the monotonicity restriction of the forward stagewise 
path appears to preclude such a succinct characterization. Alternatively, we 
can show that the lasso path is the solution to a differential equation, which 
characterizes the path in terms of a series of optimal moves. We then show 
that the forward stagewise path is the solution to a closely related differential 
equation, which restricts these optimal moves to be monotone. In the remainder 
of this section: 

• we characterize the forward stagewise path in terms of a sequence of mono- 
tone moves, and compare these moves to the less restrictive moves of the 
lasso (Theorem [1]); 

• this leads us to define the monotone lasso — a path defined by a differen- 
tial equation — with the derivatives giving the move directions from the 
current position (Definitions [2H3]). The lasso can also be characterized as 
a solution to a related differential equation; 

• we show that the monotone lasso is locally optimal in terms of arc-length — 
it makes the optimal move per unit increase in arc-length of the coefficient 
profile. The lasso makes the optimal move per unit increase in the Li norm 
of the coefficients (Theorem [2]) ; 

• we show that the forward stagewise algorithm computes the solution to 
the monotone lasso criterion. (Proposition [1]) . 

We then generalize these results for other loss functions in Section [H 

Theorem 1. Let fS'^ e R^p be a point either on the lasso or forward stagewise 
path in the expanded-variable space, and let A be the active set of variables 
achieving the maximal correlation with the current residual r = y — . 

1. The lasso coefficients move in a direction given by the coefficients of the 
least squares fit o/X^ on r. 

2. The forward stagewise coefficients move in a direction given by the coeffi- 
cients of the non-negative least squares fit of'K.j^ on r. 

In either case only the coefficients in A change, and this fixed direction is pur- 
sued until the first of the following events occurs: 

(a) a variable not in A attains the maximal correlation and joins A; 

(b) The coefficient of a variable in the active set reaches 0, at which point it 
leaves A (lasso only); 

(c) the residuals match those of the unrestricted least squares fit. 

When (a) or (b) occur, the direction is recomputed. 

The proof of t his theorem can be assembled from the results proved in 



Efron et al.l (|2004l ). For convenience we giv e a simple proof in the appendix. 



using convex optimality conditions (see also lRosset fc Zhul (|2004D ). 

Theorem [T] leads us to define the monotone lasso as the solution to a differ- 
ential equation, which is characterized in terms of its positive path derivatives. 
The FSo algorithm computes this solution. 
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Theorem [T] is stated in terms of a point on the lasso/FSo paths. In fact 
these moves can be defined starting from any value 

Definition 2. Let (3 G IR'^p be any coefficient for a linear model in the expanded 
variable set, and let v = y — X/3. Let A be the active set of variables achieving 
maximal correlation with r. 

1. The lasso move direction pi{(3) : R^^ ^ R^^ is defined 

- / ^ iflCv^Q 

P^^^' ^ \ otherwise, ^ ' 

with 9j = except for j e A, where 9j( is the least squares coefficient of r 
on X_4. 

2. The monotone lasso move direction Pmi{/3) : R^p R^p is defined 

(R\ - i z/X^r = 

^"'^^^ \ 9/J2j03 otherwise, ^ ' 

with 9j = except for j G A, where 9 a is the non-negative least squares 
coefficient of r on X_4 . 

The normahzations in ^ and ^ are not essential, but turn out to be con- 
venient when we parametrize the coefficient paths later in this section. 

Figure [3] shows the residual-sum-of-squares (RSS) curves for the lasso and 
forward stagewise algorithms, applied to our simulation example. It appears in 
this example that lasso decreases RSS most rapidly as a function of the Li 
norm of the coefficients while forward stagewise wins in terms of Li 

arc-length. It turns out that this is always the case, and is a characterization of 
the local optimality for each of the procedures. 

Theorem 2. Let /9° £ IR^^ be a coefficient vector in the expanded-variable space. 
Then the lasso /monotone lasso move directions defined in Definitions^ are op- 
timal in the sense that 

1. A lasso move decreases the residual sum of squares at the optimal quadratic 
rate with respect to the Li coefficient norm; 

2. A monotone-lasso move decreases the residual sum of squares at the opti- 
mal quadratic rate with respect to the coefficient Li arc-length. 

There is some intuition in this distinction when we think of forward stagewise 
as a form of boosting. There we pay a cost in terms of effort for any move we 
make (number of trees), which is captured by arc- length. With the lasso we get 
rewarded for decreasing a coefficient towards zero. The monotonicity constraint 
also results i n mu ch smoother coefficient profiles, and hence shorter arc-lengths. 



Zhao fc Yd (|2004[ ) propose a modification to boosting to allow the backtracking 
needed to make FSo coincide with lasso. 

Our proof follows closely the material in Section 6 of Efron et al. ( 2004^ . 



Since the directions are fixed while A is fixed, the paths are piecewise linear, 
and hence the residual-sum-of-squares curves are piecewise quadratic. 
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Fig 3. The RSS for our simulation example, as a function of the Li norm (left panel) and 
arc length (right panel) of the coefficient paths for lasso, forward stagewise, and Least Angle 
Regression. 



Proof of Theorem lasso. Consider a move in direction d from /?" : + 7 ■ d. 
Define 

i?(7) = \\y-X{(3'+j-d)\\l; (7) 

r(7) = (/?o + 7.dfl, (8) 

where 1 = (1, . . . , 1)-^. Assuming dj > when (3j — 0, T{-f) is the Li norm of 
the changed coefficient. We then compute the path derivative 

dT dj I 97 ^ ^ 

d^iC^ fy-X(/3"+7-d) 



- ' —IFI 

C/(7)k=o - -2^Tl-' (11) 

where r = y — X/S". Since xjr = C for all j £ A, the maximal-correlation active 
set, this derivative is minimized by allowing only those elements dj with j £ A 
to be nonzero. For any such d = d^, the derivative is t/(0) = — 2C. Among 
these, we seek the d^ with smallest Hessian. 

d'^R dU dU idT 



dT^ dT d-f/ 97 
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Minimizing this Rayleigh-quotient is equivalent to minimizing 

d^X^X^dyl subject to d^l = 1. 

It is straightforward to show that the solution is given by d_A oc (X^X^)^^l. 
But since X^r — C I, this is equivalent to the lasso move. 

Hence the sequence of lasso moves result in an optimal piecewise-quadratic 
RSS drop-off curve as a function of Li norm. 

Proof of Theorem 0' monotone lasso. 

The increment in the Li-arc-length of the path /3*' + 7 
is easily seen to be 

i(7)=7lMlli- 

Similar to we get 



This is minimized by selecting dj 
—2C. The Hessian is 



which we would like to minimize subject to dj > 0. This is equivalent to the 
optimization problem 

miudd^'X^X^d subject to dj > 0, X^je^'^j ~ 1- (1^) 

It is straightforward to show via the KKT conditions for this quadratic pro- 
gramming problem that the solution is identical to the solution for p in (|53|) in 
the appendix, which is the direction given by a non-negative least-squares fit of 
r to X^ (the forward-stagewise move). □ 

The graphs in Figure [3] suggest that the gap is bigger as a function of arc- 
length than norm. This is in fact the case, as can be seen in the proof of The- 
orem [2] As a function of norm, starting from the same point, the downward 
gradient is the same for both lasso and FSq, but the Hessian is smaller for lasso. 
As a function of arc-length, the gradient for lasso can be larger than for FSq, if 
some of the dj are negative. 

Armed with the lasso and monotone lasso move directions from definition [21 
we can now characterize paths as solutions to differential equations. 



■ d (starting from 

(13) 



dRh) / dm 

97/97 ^ ^ 

^ d^X^(y-X(/3 + 7-rf) .... 
> for j G A, again with minimizing value 



dV_ jdL_ 

^7 / 97 



- ^ 11JI12 ' \^^) 
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Definition 3. The monotone lasso coefficient path P(£) for a dataset X = 
{X, —X} is the solution to the differential equation 

^=Prnlim), (19) 

with initial condition /3(0) = 0. Since U9\} is piecewise continuous, this path is 
continuous and piecewise differentiable. 

Because the directions Pmiif3) defined in ([6]) are standardized to have unit 
Li norm, the solution curve is unit Li-speed, and hence is parametrized by Li 
arc length. 

In order to solve (|19p . we need to track the entire path; this solution is 
provided by the forward-stagewise algorithm. 

Proposition 1. The forward-stagewise algorithm for a dataset X = {X, —X} 
and square-error loss computes the monotone lasso path P{£); it starts at 0, and 
then increments the coefficients continuously according to the monotone lasso 
m,oves Specifically 

Initialize Set /3(0) — 0, = 0, and po — pmi{0), with corresponding active set 
Aq. 

For j=0,l,2,... 

1. Let P{e) = -\- {£- ij) ■ pj, e e [£j,£j+i], where £j+i is the value 
of £ > £j at which Aj changes to Aj+i . 

2. Compute pj+i = pjni{P{£j+i))- 

3. If pj+i = exit, and f3{£) is defined on [0, L], with L = £j+i. 

Proposition [T] follows from Theorem [TJ We can characterize the lasso path in 
a similar fashion 

Proposition 2. The lasso coefficient path [3{£) for a dataset X = {X, —X} is 
the solution to the differential equation 

^ = pim)), (20) 

with initial condition /3(0) — 0. Since i20\} is piecewise continuous, this path is 
continuous and piecewise differentiable. 

The normalization of pi defined in ^ guarantees that the solution path is 
parametrized by Li norm (since the coefficients are non- negative). 

The characterizations above draw on the similarities between the lasso and 
monotone lasso. The characterization of the monotone lasso falls slightly short 
of that of the lasso for the following reasons. 

^ Since the lasso path is defined in lTibshiranil l|l99d) as the solution to a convex optimization 
problem, this alternative characterization is a proposition (unlike Definition [Sjl , and follows 
from Theorem [T] 
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• We can define a lasso solution explicitly at any given point on the path, 
as the solution to an optimization problem ([2]); we are unable to do this 
for the monotone lasso. 

• When p < n, both the lasso and monotonc-lasso paths end in the unre- 
stricted least-squares solution. When p > n, any least squares solution 
has zero residuals, with infinitely many solution coefficients /3. The lasso 
path leads to the unique zero-residual solution having minimum Li norm. 
By construction the monotone lasso path also produces a unique zero- 
residual solution in these circumstances, but we are unable to characterize 
it further. 



4. Forward Stagewise for General Convex Loss Functions 



Gradient boosting ( Friedman 200 iL [Hastie et aL 2001 ) is often used with loss 
functions other than squared error; typical candidates are the binomial log- 
likelihood or the "Adaboost" loss for binary classification problems. Our linear- 
model simplification is also applicable there. As a concrete example, consider 
the linear logistic regression model in the expanded space: 



log 



Pr(2/ = 0|x) 



The negative of the binomial log-likelihood is given by 

n 

= - H logp, + (1 - y,) log(l - p,)] , 

i=l 

where 



(21) 
(22) 

(23) 
(24) 



More generally, consider the case where we have a linear model rj{x) = x"^ f3, 
and a loss function of the form 

n 

m^Y.^{y,,il{5:'^P)). (25) 

i=l 

The analog of Algorithm [3] for this general case is given in Algorithm 21 
For the binomial case, the negative gradient in step 2. is —dL/dPj = xj(y — 

p), where y is the 0/1 response vector, and p the vector of fitted probabilities. 
We can apply the same logic used in forward stagewise with squared error 

loss in this situation, by using a quadratic approximation to the loss at the 

current 



dL 



1 



(26) 
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Algorithm 4 Generalized Monotone Incremental Forward Stagewise Regression 

1. Start with l3i, P2, ■ ■ ■ P2p = 0- 

2. Find the predictor Xj with largest negative gradient element —dL/d(3j, 
evaluated at the current predictor 77. 

3. Update f3j ^ (3j + e. 

4. Update the predictors r]{xi) — iff}, and repeat steps 2 and 3 many times 



The two derivatives in this case are 

dL. 

9/3 lo 



where 



X^u" (27) 
X^W^X, (28) 



_ dl{y^,vi) I 
and the diagonal matrix has entries 



(29) 



o_9^?(y^| 

^^^^ - Q^2 \r^=xft}0- l-SUj 

In the case of logistic regression Wf^ = — Pi), where are the current 
probabihties, and = —{yi —Pi)- Minimizing ([26|l gives the Newton update 

S = f3~(3° 

= -(X^W"X)-iX^u", (31) 

which can be expressed as the coefficients from a weighted least squares fit of 
X on -W""^uO, with weights W". 

Definition 4. The monotone lasso move direction pmi{,l3,L) at a point (3^ , with 
expanded data X, and with loss function L is: 

1. Compute X^u'^; if all elements are zero, return p = 0. 

2. Establish the active set A of indices for which — xju'^ = max^^-^ — x^^u^. 

3. Let 5 he the coefficients from a weighted, positive, least squares fit of X^ 
on — W'' u*^, with weights W*^. 

4. Define 

[h/T.,h (32) 

[_ otherwise. 

It is easy to check that this definition coincides with the definition for squared- 
error loss. Unlike there, it will in general not be piecewise constant. We can 
expect it to be piecewise smooth, with breaks when the active sets change. 
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Definition 5. The monotone lasso coefficient path f3{£) for a dataset X = 
{X, —X} and loss L is the solution to the differential equation 

^ = p^i{P{e),L), (33) 

with initial condition /?(0) = 0. 

The definitions are exactly analogous for generalizations of the lasso. 

Unlike for squared error loss, the solution paths are in general piecewise 
smooth (but n onlinear), and so efficient exact path algorithms are not available. 
Rosset (l2005l) show that as long as the loss function is quadratic, piecewise 



linear or a mixture of both, then the paths will be piecewise linear, and can be 
tracked. 

For these general convex-loss la sso problems, la s so sol utions are always avail- 
able at any point along the path. Park fc Hastid ( 2006[ ) develop efficient algo- 



rithms for obtaining the lasso path for the generalized linear model family of 
loss functions (including logistic regression). 

For the monotone lasso and general loss function, we have no exact algo- 
rithm s for tracking the path, and hence for finding solutions at any point on the 



path. [Friedman fc Popescul (|2004[ ). however, have developed efficient e-stepping 



algorithms for finding forward-stagewise solutions for a variety of losses. 



5. Discussion of Criteria 

In conducting this research, we had several interesting false starts in terms of 
finding a criterion for forward stagewise. We briefly discuss some of these here. 

We saw in the top right panel of Figure [3] on page [l2] the residual sum of 
squares 

N p 2 

RSS(^) = ^(y,-^x,,/3,(£)) 

i=l 3=1 

for each of the three methods, as a function of their Li arc-length I. The curve 
for the forward stagewise sequence always lies below the curves for the other 
two methods. We were able to show a local optimality for forward stagewise in 
Theorem d 

We initially had thought that forward stagewise might enjoy a global opti- 
mality criterion like the lasso. 

Candidate criterion 1: For each Li arc-length £, the forward stagewise coef- 
ficient f3{£) minimizes RSS(^). 

This is true if the lasso paths are monotone, because then the two procedures 
coincide, as do Li arc-length and Li norm. But in general it is not the case. 

Lemma 2. In general there does not exist a coefficient profile that for all i 
minimizes RSS(^) over the set of curves having Li arc-length at most £. 
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Proof. For any i construct the "unit speed" coefficient path from the origin to 
the lasso solution for that £. This has Li arc-length and Li norm equal to i, and 
hence has the minimum value of RSS(^) over all curves having Li arc-length £. 
Thus any solution to our problem must agree with the lasso solution for all i. 
From the right-hand panel of Figure [H this is not the case when the lasso and 
forward stagewise profiles are different. □ 

Another attempt at a global formulation of the problem involve used the 
integrated loss. Define the set of monotone-increasing functions 

= {/3 : [0,L] ^ | ^(-q) ^ o,TV(/3,^) < e, Vl < L, f3j{e)non-decreas.} , 

having arc-length at most £ up to the point £, for all £ < L. Since the class 
is monotone, the arc-length is the Li norm, so we are asking for a monotone 
path that simultaneously solves a sequence of lasso problems, subject to that 
constraint. 



o o 

-e ©- 



E 
o 



o o ■ o o o o - 



20 



40 

L1 Arc Length 
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Fig 4, A simulation provides a counter example to candidate 2. Shown is the difference 
between mean cumulative RSS for the exact solution to criterion 2 and forward stagewise (the 
former computed on a discretized set of 40 values for arc-length). Initially forward stagewise 
wins, only to be overtaken by the exact solution. 



Candidate criterion 2: The forward stagewise algorithm minimizes the inte- 
grated residual sum of squares over the monotone class Dff : 

/3(L) = argmin^g^M / ^(y, - ^i„/3,(i?)j d£ (34) 

As the integrated loss is a continuous, strictly convex functional on ^ 
there exists a unique optimal path that solves ([51)1 . However it turns out that 
the forward stagewise solution is not always the optimal path. 



T. Hastie et al./The monotone lasso 



19 



We computed the exact solution to (|34p for our simulation example, at a 
discretized sequence of 40 values for arc-length. Figure [3] compares the results 
with the forward stagewise solution at these same points. We compute for each 
the cumulative mean RSS, and plot their difference (exact— FSq). In keeping 
with its greedy nature, FSq initially wins, only to be overtaken by the exact 
procedure. Hence FSq does not in general optimize criterion 2. 



6. Monotonicity of Profiles 

We have yet to say how the example of Figure [T] was generated. The data were 
generated from the model 

Y = sin(6X)/(l + X) + Z/A, (35) 

with X taking 300 equally spaced values in [0, 1] and Z ~ iV(0, 1). The 10 predic- 
tors were piccewise linear basis functions {x — tk)-I{x > tk), for each of the knots 
{tfc}}" = {0.0,0.1,0.2, . . .0.9}. Figure [5] shows the successive approximations to 
sin(6a;)/(l + x) by the different methods, for five equally spaced solutions along 
their paths. Despite the differences in their coefficient profiles, the fits appear 
to be quite similar. The last column of Figure [5] uses piecewise constant basis 
functions I{x > tk) in place of the piecewise linear ones {x — tk) ■ I{x > tk)- 
Figure [6] shows their coefficient profiles. Notice that all profiles are monotone, 
and hence the profiles for all three algorithms coincide. 

The fact that they are the same under monotonicity is not a coincidence, and 
follows from their definitions. In that case there are no zero-crossing events, and 
then LAR and lasso coincide. In addition, monotonicity means that positive 
coefficients are never decreased and vice-versa, hence the non-negative least 
squares move in the forward stagewise procedure is the same as the least squares 
move in LAR. 

Hence it is useful to characterize situations in which the coefficients profiles 
are monotone. Let X denote the N x p matrix of standardized predictors, and 
let Xa denote a subset of the columns of X, each multiplied by a set of arbitrary 
signs Si , 52, ■ ■ ■ S | 4 | ■ F' i nally, let Sa be a diagonal matrix of the Sj values. The 



results of lEfron et al.l (j200j) show that a necessary and sufficient condition for 



every path to be monotone is 

SAiXlXA)-^SAl > C {1, . . . Sa (36) 

In other words, for all subsets of predictors and sign changes of those predictors, 
the inverse covariance matrix must be diagonally dominant (this means that 
each diagonal element is at least as big as the sum of the of the other elements 
in its row). 

For the piecewise-linear basis functions, it is easy to find a violation of 
A — {4, 10,9} (the first three variables entered in Figure [T|), with Sa = 
diag(— 1, 1, 1), gives sonic negative entries. Condition p6|) clearly holds for any 
orthogonal basis, such as the Haar basis for piecewise constant fits. However, 



Lasso 
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Fig 5. Successive approximations to sin(6a:)/(l + x) for five equally spaced solutions along 
their paths, for the example of Figure]^ The first three columns use piecewise linear bases; 
the last column uses piecewise constant bases, and the three methods coincide. 
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Fig 6. Coefficient profiles for the same data as Figure Ul except that we have used piecewise 
constant basis functions. The coefficients are monotone, and lasso, FSq and LAR coincide. 

our piecewise constant basis is not orthogonal. We prove the following theorem 
in the appendix. 

Theorem 3. Condition I136\) holds for piecewise constant bases, and hence the 
lasso, FSq and LAR solutions coincide. 

7. Lasso versus Forward Stagewise: Which is Better? 

As discussed in Section[2l the current interest in FSe is because of its connection 
to least squares boosting. By understanding its properties in this simplified 
setting, we hope to learn more about the regularization path of boosting. 

The results of this paper show that forward stagewise behaves like a monotone 
version of the lasso, and is locally optimal with regard to Li arc-length. This is 
in contrast to the lasso, which is less constrained. 

This begs the question: with a large number of predictors, which algorithm is 
better? The monotone lasso will tend to slow down the search, not allowing the 
sudden changes of direction that can occur with the lasso. Is this a good thing? 

To investigate this, we carried out a simulation study. The data consists of 

= 60 observations on each oi p = 1000 (Gaussian) variables, strongly corre- 
lated {p = 0.95) in groups of 20. The true model has nonzero coefficients for 50 
variables, one drawn from each group, and the coefficient values themselves are 
drawn from a standard Gaussian. Finally Gaussian noise is added with variance 
cr^ = 36, resulting in a noise-to-signal ratio of about 0.72. See Appendix lA. 31 on 
page [57] for more details. 

The grouping of the variables is intended to mimic the correlations of nearby 
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Fig 7. Comparison of lasso and forward stagewise paths on simulated regression data. The 
number of samples is 60 and the number of variables 1000. The forward-stagewise paths 
fluctuate less than those of lasso in the final stages of the algorithms. Both paths are indexed 
by Li -norm \ scaled as a fraction of the L\ -norm at the end of the path | |/3(L)| |i . 



Figure [7] shows the coefficient paths for lasso and forward stagewise for a 
single realization from this model. 

Here the coefficient profiles are similar only in the early stages of the paths. 
For the later stages, the forward stagewise paths are much smoother — in fact 
exactly monotone here — while those for the lasso fluctuate widely. This is due 
to the strong correlations among subsets of the variables. 

The test-error performance of the two models is rather similar (figure [5]), and 
they achieve about the same minimum. In the later stages forward stagewise 
takes longer to overfit, a hkely consequence of the smoother paths. We are using 
1 1/3(^)1 |i to index both curves for this plot, which would look quite different if 
instead we used arc-length. Since the forward stagewise path is monotone here, 
the Li norm is arc-length, so in a sense both MSE profiles are measured using 
their appropriate index. 

On a more theoretical note, iBuhlmann (l2006l ) prov es cons i stency of forward 
stagew ise for high-dimensional linear models. See also iTropd (|2004l) and lTropd 
( 20061 ) for comparisons of lasso and forward stagewise regression. 

We conclude that for problems with large numbers of correlated predictors, 
the forward stagewise procedure and its associated Li arc-length criterion might 
be preferable to the lasso and Li norm criterion. This suggests that for general 



T. Hastie et al./The monotone lasso 23 




Fig 8. Mean squared error for lasso and forward stagewise on the simulated data. Despite 
the difference in the coefficient paths, the two models perform similarly over the critical part 
of the regularization path. In the right tail, lasso appears to overfit more rapidly. 



boosting- type applications, the incremental forward stagewise algorithms which 
are currently used, might be preferable to algorithms that try to solve the equiv- 
alent lasso problem. 



Appendix A: Appendix 
A.l. Proof of Theorem,\l\ 



Part 1.. The Lagrangian corresponding to ((Sjl is 







' V P 
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Vi- l3o - 
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.J=l j=l 

p 
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(37) 


with KKT conditions (for 


each j) : 












— xjr -f A — Aj" 


-- 0, 




(38) 






xjr -f A — A~ 


-- 0, 




(39) 








-- 0, 




(40) 






^JP7 -- 


= 0. 




(41) 


Here r = y — 




^ — y^''. , x,/3. is the residual vector. From these 


we can 



deduce the following aspects of the solution: 
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1. If A = 0, then xjr — OVj, and the solution corresponds to the unrestricted 
least-squares fit. 

2. 

/?+ > 0, A > =^ ^j" = 

=^ xjr = A > 
^ AT > 
=^ /3- = 0. 

3. Likewise (3~ > 0, A > = 0. Hence 2 and 3 give the intuitive result 
that only one of the pair can be positive at any time. 

4. |xjr| < A. 

5. If /3+ > 0, then xjr = A, or if I3~ > 0, then -xjr = A 

Since /J" is on the lasso path, there exists a A = Ao such that /3° = /3(Ao). Define 
the active set A to be the set of indices of variables in X with positive coefficients at 
A = Ao, and assume at Ai = Ac — A for suitably small A > this set has not changed. 
Define /3.a(A) to be the corresponding coefficients at A. Then from deduction [5] 

(y - ±aM\)) = Al for A G [Ai, Aq]. (42) 

Hence 

XjX^(/3^(Ai) - MXo)) = Al, (43) 

or 

/3^(Ai)-/3^(Ao) = A.(XSx^)'il. (44) 

So while A remains constant, the coefficients /3^(A) change linearly, according to (|44|) . 
Since r = y - X^/3^(Ao) and X^r = XqI from (ggj, 

/3^(Ai) - MXo) = ^ ■ (X5x^)"'x5r (45) 
Ao 

as claimed. 

The active set A will change if a variable "catches up" (in terms ofO, in which case 
it is augmented and the direction (|44p is recalculated. It will also change if a coefficient 
attempts to pass through zero, in which case it is removed from A. □ 

Part 2.. At each step, the monotone incremental forward stagewise algorithm (Algo- 
rithm |3]) selects the variable having largest correlation with the residuals, and moves 
its coefficient up by e. There may be a set A of variables competing for this maximal 
correlation, and a succession of A'^ such moves can be divided up according to the 
= Pj^ that augmented variable js coefficient. In the limit as e decreases and N 
increases such that Ne = e, we can expect an active set A of variables tied in terms 
of the largest correlation, and a seque nce of moves of tot al Li arc-length e distributed 
among this set with proportions pA- lEfron et al.l (|2004l ) showed in their Lemma 11 
that for sufficiently small e, this set would not change. Suppose Xjr = cl for some c, 
reflecting the equal correlations. 

The limiting sequence of moves epA must have positive components, must main- 
tain the equal correlation with the residuals, and subject to these constraints should 
decrease the residual sum-of-squares as fast as possible. 



T. Hastie et al./The monotone lasso 



25 



Consider the optimization problem 



minp^llr - £'^Ap\\l subject to pj > 0, J2jeA = ^' (^^) 



which captures the first and third of these requirements. The Lagrangian is 

p 



(49) 



Lip,l,X) = -i|r- eXplli - ^7.p. + A(^p, - 1), (47) 
with KKT conditions 

- £X5 (r - e±Ap) - 7 + Al = (48) 

7i > 
Pj > 
Ijpj = 

Here 7 is a vector with components 7^, j G A. Note that for pj > 0, 7_, — 0, and hence 
(|48|) shows that the correlations with the residual remain equal, which was the second 
requirement above. 

Consider a second optimization problem (the one in the statement of the the theo- 
rem): 

minsi||r - X^6I||2 subject to Oj > 0. (50) 



The corresponding KKT conditions are 



X;^(r - i^A0) -~u = (51) 



> (52) 

,y,e, = 

We now show that the p = solves (|46|l for all e > 0, where 8 is the solution 

to ([50|. From (gS]) we get 

e'xJX^p^ (A + ec)l + 7, (53) 

and from (fSTj) 

X.A^A0 = cl + v. (54) 



With s = \\e\\i = Oj, we multiply (O by / s to get 



e'X^X^p = — (cl + i/) (55) 



s 



(A*+ec)l + 7*, (56) 



where 



7i = V, 



A = EC. 

s 



It is easy to check, using ((52)) . that (p, 7*, A*) satisfy (|48t - ((49)) . 
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Variables with 9j = may drop out of the active set, since from (|48|l and (|49p . 
if 7j > 0, for e > their correlation will decrease faster than those with positive 
coefficients. 

This directions is pursued until a variable not in A "catches up" in terms of corre- 
lation, at which point the procedure stops, A is updated, and the direction is recom- 
puted. □ 



A. 2. Proof of Theorem\^ 

We need to verify that when using piecewise constant basis functions, (|36|l holds for 
every A and every sign matrix Sa- 

Suppose we use k piecewise constant basis functions with knots t\ < ■ ■ ■ < tk- Let 



be the number of observed a;'s to the right of the j-th knot. Without loss of generality, 
we assume that each rij > 0, otherwise that predictor contributes nothing to the model 
as all observed a;'s are either to the right or the left of that knot. 

A simple calculation shows that, for i < j, after normalizing the columns of X, 



which is the covariance function of a Browriian bridge {Bs)o<s<i, normalized to have 
unit variance, at the time points < si < ■ ■ ■ < < 1 

nk-j+i 

Sj = — ■ 

n 

If we can prove that is diagonally dominant for every k and every choice of 

knots, then, as every principal minor is of exactly the same form (with a smaller k and 
fewer knots), we will also have proved that {X\Xa)~^ is diagonally dominant, hence 
that the lasso paths are monotone. 

We prove that (X*X)~^ is diagonally dominant by computing (X*X)^^. One way 
to compute the {X*X)~^ is to compute the density of 



^■51(1 - Si) \/sk{l 



Sfe I 



and read off the inverse from the exponent of the density. Before we turn to computing 
the density, we note that 

Bs, Bs, \ D fW^ Ws^ 



where W is a standard Brownian motion and 



T. Hastie et al./The monotone lasso 



27 



It is now simple to show that, up to a constant muhiple, the exponent of the density 

of 



evaluated at (lui, . . . ,Wk) is 

'- / 1/2 1/2 n2 



2 

W 



i=2 



Therefore ^ has elements 



= ( i + i ) 



{Vj+l - Vj-l)Vj 
(Vj - Vj-l){vj+l - V-i) 



Because the off-diagonal entries of (X^X)"^ are non-positive, we only have to show 
that 

{x'xy^i > 0, 

as multiplying on the left and right by Sa will only increase the entries of the above 
vector. 

We must therefore prove that for all j, 

(Vj + i - Vj-l)vj y^VjV]-! V^J+l^i > Q 



- fi-l)fe + l - "j) Vj-Vj-i Vj + l-Vj 

By scaling and combining fractions, the above is implied by the following: for every 
a<l <b 

— a — a 

It remains therefore to prove that this inequality holds. However, this is just Jensen's 
inequality: define a two-point distribution placing mass (6 — — a) on ^/a and mass 
(1 — a)/(fe — a) on Vb. Then, if Z is distributed according to this law: 

E(X) = ^.^ + Vb-^< (E(X^))^/^ = 1. 
— a — a 

We note that general conditions for monotonicity can be derived. However it is not 
clear how these might be verified in practice. 



A. 3. Simulation Details 



Here we give more details of the simulation in Section [T] The regression model has the 
form 

Y = X(5 + e, (57) 
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where X ~ A'^(0, E). X has 1000 components, correlated in blocks ol size 20. Hence 
E is a block-diagonal covariance matrix, with 50 blocks Em each of size 20, and each 
block has the (identical) form 

E„ = (l-p)l2o + pll^. (58) 

We used p = 0.95 in each block. The 1000-vector /3 was chosen to be sparse, with 
only 50 non-zero entries — one per block. Without loss of generality, we picked the 
first variable in each block to have a non-zero coefficient, drawn at random from a 
standard Gaussian distribution. The noise term e was also Gaussian, with variance 
— 36. TV = 60 realizations were drawn from this model. For the noise-to signal 
ratio we compute var(e)/var(X/3). Since X and /3 have mean zero, the denominator is 
£{0^^.13) = \.t[T.E{I30^)\ = 50, hence the ratio is 0.72. 
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